PhD

The LaTeX sources of my Ph.D. thesis
git clone https://esimon.eu/repos/PhD.git
Log | Files | Refs | README | LICENSE

aggregate.tex (23307B)


      1 \section{Supervised Aggregate Extraction Models}
      2 \label{sec:relation extraction:aggregate}
      3 All the approaches introduced thus far are sentential.
      4 They map each sample to a relation individually, without modeling the interactions between samples.
      5 In contrast, this section focuses on aggregate approaches (Equation~\ref{eq:relation extraction:aggregate definition}).
      6 Aggregate approaches explicitly model the connections between samples.
      7 The most common aggregate method is to ensure the consistency of relations predicted for a given entity pair \(\vctr{e}\in\entitySet^2\) by processing together all sentences \(s\in\sentenceSet\) mentioning \(\vctr{e}\).
      8 To this end, we define \(\dataSet^\vctr{e}\) to be the dataset \(\dataSet\) grouped by entity pairs.
      9 Thus, instead of containing a sample \(x=(s, \vctr{e})\), the dataset \(\dataSet^\vctr{e}\) contains bag of mentions \(\vctr{x}=\{(s, \vctr{e}), (s', \vctr{e}), \dotsc\}\) of the same entity pair \(\vctr{e}\).
     10 Most aggregate methods are built upon sentential approaches and provide a sentential assignment.
     11 Therefore, more often than not, each sample is still mapped to a relation.
     12 Therefore, the evaluations of aggregate methods follow the evaluations of sentential approaches introduced in Section~\ref{sec:relation extraction:supervised evaluation}.
     13 
     14 \subsection{Label Propagation}
     15 \label{sec:relation extraction:label propagation}
     16 To deal with the shortage of manually labeled data, one approach is to use labels weakly correlated with the samples as in distant supervision (Section~\ref{sec:relation extraction:distant supervision}).
     17 Another approach is to label a small subset of the dataset but leave most samples unlabeled.
     18 This is the semi-supervised approach.
     19 The bootstrapped models (Section~\ref{sec:relation extraction:bootstrap}) can also be seen as semi-supervised approaches: a small number of labeled samples are given to the model, which then crawls the web to obtain new unsupervised samples.
     20 The evaluation of semi-supervised models follows the one of supervised models described in Section~\ref{sec:relation extraction:supervised evaluation}.
     21 The difference between the two lies in the fact that unsupervised samples can be used to gain a better estimate of the input distribution in the semi-supervised settings, while fully-supervised models cannot make use of unsupervised samples.
     22 
     23 Apart from bootstrapped models, one of the first semi-supervised relation extraction systems was proposed by \textcitex{label_propagation_re}.
     24 They build their model on top of hand-engineered features (Section~\ref{sec:relation extraction:hand-designed features}) compared using a similarity function.
     25 This is somewhat similar to kernel approaches (section~\ref{sec:relation extraction:kernel}), except that this function does not need to be positive semidefinite.
     26 Given all samples in feature space, the labels from the supervised samples are propagated to the neighboring unlabeled samples using the label propagation algorithm \parencite{label_propagation} listed as Algorithm~\ref{alg:relation extraction:label propagation}.
     27 This propagation takes the form of a convex combination of other samples' labels weighted by the similarity function.
     28 Let's call \(\operatorname{sim}\) this unlabeled sample similarity function:
     29 \begin{equation*}
     30 	\operatorname{sim}\colon (\sentenceSet\times\entitySet^2)\times(\sentenceSet\times\entitySet^2)\to\symbb{R}.
     31 \end{equation*}
     32 The label propagation algorithm builds a pairwise similarity matrix between labeled and unlabeled samples which have been column normalized then row normalized:
     33 \begin{marginalgorithm}
     34 	\input{mainmatter/relation extraction/label propagation.tex}
     35 	\scaption[The label propagation algorithm.]{
     36 		The label propagation algorithm.
     37 		The notation \(\delta_{a,b}\) is a Kronecker delta, equals to \(1\) if \(a=b\) and to \(0\) otherwise.
     38 		The two loops assigning to \(y_{ij}\) are simply enforcing that the relation assigned to the labeled samples do not deviate from their gold value.
     39 		\label{alg:relation extraction:label propagation}
     40 	}
     41 \end{marginalgorithm}
     42 \begin{equation}
     43 	t_{ij} \propto \frac{\exp\big(\operatorname{sim}(x_i, x_j)\big)}{\displaystyle \sum_{x_k\in \dataSet \cup \dataSet_\relationSet} \exp\big(\operatorname{sim}(x_k, x_j)\big)} \quad \text{for } i,j\in\{1,\dotsc,|\dataSet|+|\dataSet_\relationSet|\}
     44 	\label{eq:relation extraction:label propagation transition}
     45 \end{equation}
     46 The relation assigned to each unlabeled sample is then recomputed by aggregating the labels---whether these labels come from \(\dataSet_\relationSet\) or were computed at a previous iteration---of all other samples weighted by \(\mtrx{T}\).
     47 Note that labels assigned to samples coming from \(\dataSet_\relationSet\) are not altered.
     48 This operation is repeated until the label assignment stabilizes.
     49 This label propagation algorithm has been shown to converge to a unique solution \parencite{label_propagation}.
     50 
     51 \Textcite{label_propagation_re} tried two similarity functions: the cosine and the Jensen--Shannon of the feature vectors.
     52 They evaluated their approach on the \textsc{ace}~2003 dataset (Section~\ref{sec:datasets:ace}) using different fractions of the labels to show that while their model was roughly at the same performance level than others when using the whole dataset, it decisively outperformed other methods when using a small number of labels.
     53 
     54 \subsection{Multi-instance Multi-label}
     55 \label{sec:relation extraction:miml}
     56 Following the popularization of distant supervision by \textcite{distant}, training datasets gained in volume but lost in quality (see Section~\ref{sec:relation extraction:distant supervision}).
     57 In order to create models more resilient to the large number of false-positive in distantly-supervised datasets, multi-instance approaches~\parencite{multi-instance} started to get traction.
     58 
     59 In the article of \textcite{distant}, all mentions of the same entity pair are viewed as a single sample to make a prediction.
     60 Their model is a simple logistic classifier on top of hand-engineered features, which could only predict a single relation label per entity pair.
     61 However, when aggregating the features of all mentions and supervising with a single relation, \textcite{distant} backpropagate to all features, i.e.~the parameters used by all mentions are modified.
     62 This assumes that all mentions should convey the relation.
     63 To avoid this assumption, the more sophisticated multi-instance assumption is used:
     64 \begin{assumption}[multiinstance]{multi-instance}
     65 	All facts \((\vctr{e}, r)\in\kbSet\) are conveyed by at least one sentence of the unlabeled dataset \(\dataSet\).
     66 
     67 	\smallskip
     68 	\noindent
     69 	\(\forall (e_1, e_2, r)\in\kbSet : \exists (s, e_1, e_2)\in\dataSet : (s, e_1, e_2) \text{ conveys } \tripletHolds{e_1}{r}{e_2}\)
     70 \end{assumption}
     71 
     72 MultiR \parencitex{multir} follows such a multi-instance setup but also models multiple relations and thus does not assume \hypothesis{1-adjacency}, unlike all the models introduced thus far.
     73 Figure~\ref{fig:relation extraction:miml setup} illustrates this setup, which is dubbed \textsc{miml} (multi-instance multi-label) following the subsequent work of \textcite{miml}.
     74 
     75 \begin{marginfigure}
     76 	\centering
     77 	\input{mainmatter/relation extraction/miml setup.tex}
     78 	\scaption[Multi-instance multi-label (\textsc{miml}) setup.]{
     79 		Multi-instance (\(n>1\)) multi-label (\(m>1\)) setup.
     80 		Each entity pair appears in several instances and the two entities are linked by several relations.
     81 		\label{fig:relation extraction:miml setup}
     82 	}
     83 \end{marginfigure}
     84 
     85 MultiR uses a latent variable \(z\) to capture the sentential extraction.
     86 That is, for each sentence \(x_i\in\dataSet_\relationSet\), the latent variable \(\rndm{z}_i\in\relationSet\) captures the relation conveyed by \(x_i\).
     87 Furthermore, for a given entity pair \(\vctr{e}\in\entitySet^2\), for all \(r\in\relationSet\), a binary classifier \(y_r\) is used to predict whether this pair is linked by \(r\).
     88 In this fashion, multiple relations can be predicted for the same entity pair.
     89 The model can be summarized by the plate diagram of Figure~\ref{fig:relation extraction:multir plate}.
     90 \begin{marginfigure}
     91 	\centering
     92 	\input{mainmatter/relation extraction/multir plate.tex}
     93 	\scaption[MultiR plate diagram.]{
     94 		MultiR plate diagram.
     95 		Where \tikz{\node[pdiag factor]{};} denotes factor nodes.
     96 		\label{fig:relation extraction:multir plate}
     97 	}
     98 \end{marginfigure}
     99 Let's define \(\dataSet_\relationSet^\vctr{e}\) the dataset \(\dataSet_\relationSet\) where samples are grouped by entity pairs.
    100 Since multiple relations can link the same entity pair, we will use \(\vctr{y}\in \{0, 1\}^{\relationSet}\) to refer to the binary vector indexing the conveyed relations.
    101 Formally, MultiR defines the probability of the sentential (\(\vctr{z}\)) and aggregate (\(\vctr{y}\)) assignments for a mention bag (\(\vctr{x}\)) as follow:
    102 \begin{equation}
    103 	P(\vctr{y}, \vctr{z}\mid \vctr{x}; \vctr{\theta}) \propto \prod_{r\in\relationSet} \vctr{\phi}^\text{join}(y_r, \vctr{z}) \prod_{x_i\in\vctr{x}} \vctr{\phi}^\text{extract}(z_i, x_i; \vctr{\theta})
    104 	\label{eq:relation extraction:multir}
    105 \end{equation}
    106 where \(\vctr{\phi}^\text{join}\) simply aggregate the predictions for all mentions:
    107 \begin{equation*}
    108 	\vctr{\phi}^\text{join}(y_r, \vctr{z}) =
    109 	\begin{cases}
    110 		1 & \text{if \(y_r=1 \land \exists i : z_i=r\)} \\
    111 		0 & \text{otherwise}
    112 	\end{cases}
    113 \end{equation*}
    114 and \(\vctr{\phi}^\text{extract}\) is a weighted sum of several hand-designed features:
    115 \begin{equation*}
    116 	\vctr{\phi}^\text{extract}(z_i, x_i; \vctr{\theta}) = \exp\left(
    117 		\sum_{\text{feature \(j\)}} \theta_j \phi_j(z_i, x_i)
    118 	\right)
    119 \end{equation*}
    120 
    121 We now describe the training algorithm used by MultiR, which is listed as Algorithm~\ref{alg:relation extraction:multir}.
    122 Following the multi-instance setup, MultiR assumes that every fact \((e_1, r, e_2)\in\kbSet\) is conveyed by at least one mention \((s, e_1, e_2)\in\dataSet\).
    123 This can be seen in the first product of Equation~\ref{eq:relation extraction:multir}: if a single gold relation is not predicted for any sentence, the whole probability mass function drops to 0.
    124 This means that during inference, each relation \(r\) conveyed in the knowledge base must be covered by at least one sentential extraction \(z\).
    125 \begin{marginparagraph}
    126 	In particular, note that if an entity pair is linked by more relations than it has mentions in the text, the algorithm collapses since each mention conveys a single relation.
    127 \end{marginparagraph}
    128 Given all sentences \(\vctr{x}_i\subseteq\dataSet\) containing an entity pair \((e_1, e_2)\), when the model does not predict the actual set of relations \(\vctr{y}_i=\{\,r \mid (e_1, r, e_2)\in\kbSet\,\}\), the parameters \(\vctr{\theta}\) must be tuned such that every relation \(r\in\vctr{y}_i\) is conveyed by at least one sentence, as expressed by the line:
    129 \begin{algorithm}[t]
    130 	\centering
    131 	\begin{minipage}{7cm}
    132 		\input{mainmatter/relation extraction/multir.tex}
    133 	\end{minipage}
    134 	\scaption*[The MultiR training algorithm.]{
    135 		The MultiR training algorithm.
    136 		For each bag of mentions \(\vctr{x}_i\), the more likely sentential and aggregate predictions \((\vctr{y}', \vctr{z}')\) are made.
    137 		If the predicted relations are different from the true relations \(\vctr{y}_i\) linking the two entities, the parameters \(\vctr{\theta}\) are adjusted such that \(\vctr{z}\) cover all relations in \(\vctr{y}_i\).
    138 		\label{alg:relation extraction:multir}
    139 	}
    140 \end{algorithm}
    141 \begin{equation*}
    142 	\vctr{z}^*\gets \argmax_{\vctr{z}} P(\vctr{z}\mid \vctr{x}_i, \vctr{y}_i; \vctr{\theta}).
    143 \end{equation*}
    144 This can be reframed as a weighted edge-cover problem, where the edge weights are given by \(\vctr{\phi}^\text{extract}(z_i, x_i; \vctr{\theta})\).
    145 The MultiR training algorithm can be seen as maximizing the likelihood \(P(\vctr{y}\mid \vctr{x}; \vctr{\theta})\) where a Viterbi approximation was used---the expectations being replaced with maxima.
    146 
    147 The multi-instance multi-label (\textsc{miml}) phrase was introduced by \textcitex{miml}.
    148 Their approach is similar to that of MultiR except that they train a classifier for \(\vctr{\phi}^\text{join}\) instead of using a deterministic process.
    149 Their training procedure also differs.
    150 They train in the Bayesian framework using an expectation--maximization algorithm.
    151 In general, \textsc{miml} approaches are challenging to evaluate systematically since they suffer from low precision due to incomplete knowledge bases.
    152 In particular, they were not compared with traditional supervised approaches.
    153 For reference, \textcite{miml} compare the three methods mentioned in this section on the same datasets and observe that at the threshold at which recall goes over 30\%, the precision falls under 30\%.
    154 
    155 \subsection{Universal Schemas}
    156 \label{sec:relation extraction:universal schemas}
    157 Another important weakly-supervised model is the universal schema approach designed by \textcitex{universal_schemas}.
    158 In their setting, existing relations and surface forms linking two entities are considered to be of the same nature.
    159 Slightly departing from their terminology, we refer to the union of relations (\(\relationSet\)) and surface forms (\(\sentenceSet\)) by the term ``items'' (\(\itemSet=\relationSet\cup\sentenceSet\)) for their similarity with the collaborative filtering concept.
    160 \Textcite{universal_schemas} consider that entity pairs are linked by items such that the dataset available could be refered to as \(\dataSet_\itemSet\subseteq\entitySet^2\times\itemSet\).
    161 This can be obtained by taking the union of an unlabeled dataset \(\dataSet\) and a knowledge base \(\kbSet\).
    162 This dataset \(\dataSet_\itemSet\) can be seen as a matrix with entity pairs corresponding to rows and items corresponding to columns.
    163 With this in mind, relation extraction resembles collaborative filtering.
    164 Figure~\ref{fig:relation extraction:universal schema matrix} gives an example of this matrix that we will call \(\mtrx{M}\in\symbb{R}^{\entitySet^2\times\itemSet}\).
    165 
    166 \begin{figure}[ht!]
    167 	\centering
    168 	\input{mainmatter/relation extraction/universal schema.tex}
    169 	\scaption[Universal schema matrix.]{
    170 		Universal schema matrix.
    171 		Observed entity--item pairs are shown in green, blue cells are unobserved values, while orange cells are unobserved values for which a prediction was made.
    172 		The observed values on the left (surface forms) come from an unsupervised dataset \(\dataSet\), while the observed values on the right (relations) come from a knowledge base \(\kbSet\).
    173 		\label{fig:relation extraction:universal schema matrix}
    174 	}
    175 \end{figure}
    176 
    177 \Textcite{universal_schemas} purpose to model this matrix using a combination of three models.
    178 One of them being a low-rank matrix factorization:
    179 \begin{equation*}
    180 	m^\text{F}_{ei} = \sum_{j=0}^d u_{ej} v_{ij}
    181 \end{equation*}
    182 where \(d\) is a hyperparameter, and \(\mtrx{U}\in\symbb{R}^{\entitySet^2\times d}\) and \(\mtrx{V}\in\symbb{R}^{\itemSet\times d}\) are model parameters.
    183 The two other models are an inter-item neighborhood model and selectional preferences (described in Section~\ref{sec:context:selectional preferences}), which we do not detail here.
    184 Training such a model is difficult since we do not have access to negative facts: not observing a sample \((\vctr{e}, i)\not\in\dataSet_\itemSet\) does not necessarily imply that this sample is false.
    185 To cope with this issue, \textcite{universal_schemas} propose to use the Bayesian personalized ranking model (\textsc{bpr}, \citex{bpr}).
    186 Instead of enforcing each element \(m_{ei}\) to be equal to \(1\) or \(0\), \textsc{bpr} relies upon a ranking objective pushing element observed to be true to be ranked higher than unobserved elements.
    187 This is done through a contrastive objective between observed positive samples and unobserved negative samples from a uniform distribution:
    188 \begin{equation*}
    189 J_\textsc{us}(\vctr{\theta}) =
    190 	\sum_{(\vctr{e}^+,i)\in\dataSet_\itemSet}
    191 	\sum_{\substack{(\vctr{e}^-,i)\in\entitySet^2\times\itemSet\\(\vctr{e}^-,i)\not\in\dataSet_\itemSet}}
    192 	\log \sigma(m_{e^+i} - m_{e^-i})
    193 \end{equation*}
    194 This objective can be directly maximized using stochastic gradient ascent.
    195 \Textcite{universal_schemas} experiment on a \(\textsc{nyt}+\textsc{fb}\) dataset, this means the unsupervised dataset \(\dataSet\) comes from the New York Times (\textsc{nyt}, Section~\ref{sec:datasets:nyt}) and the knowledge base \(\kbSet\) is Freebase (\textsc{fb}, Section~\ref{sec:datasets:freebase}).
    196 
    197 \subsection{Aggregate \textsc{pcnn} Extraction}
    198 \label{sec:relation extraction:pcnn aggregate}
    199 \textsc{pcnn} is a sentence-level feature extractor introduced in Section~\ref{sec:relation extraction:pcnn}.
    200 \Textcitex{pcnn} introduce the \textsc{pcnn} feature extractor together with a multi-instance learning algorithm.
    201 Given a bag of mentions \(\vctr{x}\in\dataSet^\vctr{e}\), for each mention \(x_i\in\vctr{x}\), they model \(P(\rndm{r}\mid x_i; \vctr{\theta})\).
    202 However, the optimization is done over each bag of mentions separately:
    203 \begin{align}
    204 	\symcal{L}_\textsc{pcnn}(\vctr{\theta}) & = - \sum_{(\vctr{x}, r)\in\dataSet^\vctr{e}_\relationSet} \log P(r\mid x^*; \vctr{\theta})
    205 	\label{eq:relation extraction:pcnn loss} \\
    206 	x^* & = \argmax_{x_i\in \vctr{x}} P(r\mid x_i; \vctr{\theta})
    207 	\label{eq:relation extraction:pcnn argmax}
    208 \end{align}
    209 In other words, for a set of mention \(\vctr{x}\) of an entity pair, the network backpropagates only on the sample that predicts a relation with the highest certainty.
    210 Thus \textsc{pcnn} is a multi-instance single-relation model, it assumes \hypothesis{multi-instance} but also \hypothesis{1-adjacency}.
    211 
    212 \Textcite{pcnn} continue to use the experimental setup of \textcite{miml}, i.e.~using a distantly supervised dataset, but complement it with a manual evaluation to have a better estimate of the precision.
    213 
    214 \Textcitex{pcnn_attention} improve the \textsc{pcnn} model with an attention mechanism over mentions to replace the \(\argmax\) of Equation~\ref{eq:relation extraction:pcnn argmax}.
    215 The attention mechanism's memory is built from the output of the \textsc{pcnn} on each mention without applying a softmax; the \textsc{pcnn} is simply used to produce a representation for each mention.
    216 Equations~\ref{eq:relation extraction:pcnn loss} and~\ref{eq:relation extraction:pcnn argmax} are then replaced by:
    217 \begin{align*}
    218 	\symcal{L}_\text{Lin}(\vctr{\theta}) & = - \sum_{(\vctr{x}, r)\in\dataSet^\vctr{e}_\relationSet} \log P(r\mid \vctr{x}; \vctr{\theta}) \\
    219 	P(r\mid \vctr{x}; \vctr{\theta}) & \propto \exp( \mtrx{W} \vctr{s}(\vctr{x}, r) + \vctr{b} ) \\
    220 	\vctr{s}(\vctr{x}, r) & = \sum_{x_i\in\vctr{x}} \alpha_i \operatorname{\textsc{pcnn}}(x_i)
    221 \end{align*}
    222 where the \(\alpha_i\) are attention weights computed from a bilinear product between the query \(r\) and the memory \(\operatorname{\textsc{pcnn}}(\vctr{x})\), similarly to the setup of Section~\ref{sec:context:attention}.
    223 \Textcite{pcnn_attention} show that this modification improves the results of \textsc{pcnn}, this can be seen as a relaxation of \hypothesis{multi-instance}: the standard \textsc{pcnn} approach assumes that each fact in \(\kbSet\) is conveyed by a single sentence through its \(\argmax\); in contrast, the attention approach simply assumes that all facts are conveyed in \(\dataSet\), at least by one sentence but possibly by several ones.
    224 
    225 \subsection{Entity Pair Graph}
    226 \label{sec:relation extraction:epgnn}
    227 The multi-instance approach shares information at the entity pair level.
    228 However, information could also be shared between different entity pairs.
    229 This is the idea put forth by entity pair graph neural network (\textsc{epgnn}, \citex{epgnn}).
    230 The basic sharing unit becomes the entity: when two mentions \((s, e_1, e_2), (s', e_1', e_2')\in\dataSet\) share at least one entity (\(\{e_1, e_2\}\cap\{e_1', e_2'\}\neq\emptyset\)), their features interact with each other in order to make a prediction.
    231 The sharing of information is made following an entity pair graph that links together bags of mentions with a common entity as illustrated in Figure~\ref{fig:relation extraction:entity pair graph}.
    232 
    233 \begin{figure}[ht!]
    234 	\centering
    235 	\input{mainmatter/relation extraction/entity pair graph.tex}
    236 	\scaption[Entity pair graph.]{
    237 		Entity pair graph.
    238 		Each node corresponds to a bag of mentions, each edge of the graph corresponds to an entity in common between the two bags, the edges are labeled with the shared entity.
    239 		For illustration purpose, we show a single sample per bag.
    240 		This example is from the SemEval 2010 Task 8 dataset (described in Section~\ref{sec:datasets:semeval}).
    241 		All sentences convey the \textsl{entity-destination} relation.
    242 		\label{fig:relation extraction:entity pair graph}
    243 	}
    244 \end{figure}
    245 
    246 To obtain a distributed representation for a sentence, \textsc{epgnn} uses \textsc{bert} (Section~\ref{sec:context:transformers}).
    247 More precisely, it combines the embedding of the \textsc{cls} token%
    248 \sidenote{
    249 	As a reminder, the \textsc{cls} token is the marker for the beginning of the sentence, its embedding purposes to represent the whole sentence.
    250 }
    251 with the embeddings corresponding to the two entities through a mean pooling.
    252 The sentence feature extraction architecture is illustrated by Figure~\ref{fig:relation extraction:epgnn sentence representation}.
    253 This is one of several methods to obtain an entity-aware fixed-size representation of a tagged sentence; other approaches are developed in Section~\ref{sec:relation extraction:mtb sentential}.
    254 
    255 \begin{figure}[ht!]
    256 	\centering
    257 	\input{mainmatter/relation extraction/epgnn sentence representation.tex}
    258 	\scaption[\textsc{epgnn} sentence representation.]{
    259 		\textsc{epgnn} sentence representation.
    260 		``Bentham'' was split into two subword tokens, ``Ben-'' and ``-tham'' by the \textsc{bpe} algorithm described in Section~\ref{sec:context:bpe}.
    261 		The contextualized embeddings of most words are ignored.
    262 		The final representation is only built using the entities span and the \textsc{cls} token.
    263 		Not appearing on the figure are linear layers used to post-process the output of the mean poolings and the final representation as well as a \(\ReLU\) non-linearity.
    264 		Compare to Figure~\ref{fig:relation extraction:emes}.
    265 		\label{fig:relation extraction:epgnn sentence representation}
    266 	}
    267 \end{figure}
    268 
    269 Given a vector representation for each sentence in the dataset, we can label the vertices of the entity pair graph.
    270 A spectral graph convolutional network (\textsc{gcn}, Section~\ref{sec:graph:spectral gcn}) is then used to aggregate the information of its neighboring samples into each vertex.
    271 Thus, \textsc{epgnn} produces two representations for a sample: one sentential and one topological.
    272 From these two representations, a prediction is made using a linear and softmax layer.
    273 Since a single relation is produced for each sample, \textsc{epgnn} is trained using the usual classification cross-entropy loss.
    274 More details on graph-based approaches are given in Chapter~\ref{chap:graph}.
    275 
    276 \Textcite{epgnn} evaluate \textsc{epgnn} on two datasets, SemEval~2010 Task~8 (Section~\ref{sec:datasets:semeval}) and \textsc{ace}~2005 (Section~\ref{sec:datasets:ace}).
    277 Reaching a half-directed macro-\(\overHalfdirected{\fone}\) of 90.2\% on the first one, and a micro-\fone{} of 77.1\% on the second.